12 research outputs found
ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics
As machine translation (MT) metrics improve their correlation with human
judgement every year, it is crucial to understand the limitations of such
metrics at the segment level. Specifically, it is important to investigate
metric behaviour when facing accuracy errors in MT because these can have
dangerous consequences in certain contexts (e.g., legal, medical). We curate
ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging
from simple perturbations at the word/character level to more complex errors
based on discourse and real-world knowledge. We use ACES to evaluate a wide
range of MT metrics including the submissions to the WMT 2022 metrics shared
task and perform several analyses leading to general recommendations for metric
developers. We recommend: a) combining metrics with different strengths, b)
developing metrics that give more weight to the source and less to
surface-level overlap with the reference and c) explicitly modelling additional
language-specific information beyond what is available via multilingual
embeddings.Comment: preprint for WMT 202
ACES: Translation Accuracy Challenge Sets at WMT 2023
We benchmark the performance of segmentlevel metrics submitted to WMT 2023
using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists
of 36K examples representing challenges from 68 phenomena and covering 146
language pairs. The phenomena range from simple perturbations at the
word/character level to more complex errors based on discourse and real-world
knowledge. For each metric, we provide a detailed profile of performance over a
range of error categories as well as an overall ACES-Score for quick
comparison. We also measure the incremental performance of the metrics
submitted to both WMT 2023 and 2022. We find that 1) there is no clear winner
among the metrics submitted to WMT 2023, and 2) performance change between the
2023 and 2022 versions of the metrics is highly variable. Our recommendations
are similar to those from WMT 2022. Metric developers should focus on: building
ensembles of metrics from different design families, developing metrics that
pay more attention to the source and rely less on surface-level overlap, and
carefully determining the influence of multilingual embeddings on MT
evaluation.Comment: Camera Ready WMT 2023. arXiv admin note: text overlap with
arXiv:2210.1561
Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking
Recent progress in task-oriented neural dialogue systems is largely focused
on a handful of languages, as annotation of training data is tedious and
expensive. Machine translation has been used to make systems multilingual, but
this can introduce a pipeline of errors. Another promising solution is using
cross-lingual transfer learning through pretrained multilingual models.
Existing methods train multilingual models with additional code-mixed task data
or refine the cross-lingual representations through parallel ontologies. In
this work, we enhance the transfer learning process by intermediate fine-tuning
of pretrained multilingual models, where the multilingual models are fine-tuned
with different but related data and/or tasks. Specifically, we use parallel and
conversational movie subtitles datasets to design cross-lingual intermediate
tasks suitable for downstream dialogue tasks. We use only 200K lines of
parallel data for intermediate fine-tuning which is already available for 1782
language pairs. We test our approach on the cross-lingual dialogue state
tracking task for the parallel MultiWoZ (English -> Chinese, Chinese ->
English) and Multilingual WoZ (English -> German, English -> Italian) datasets.
We achieve impressive improvements (> 20% on joint goal accuracy) on the
parallel MultiWoZ dataset and the Multilingual WoZ dataset over the vanilla
baseline with only 10% of the target language task data and zero-shot setup
respectively.Comment: EMNLP 2021 Camera Read
Extrinsic Evaluation of Machine Translation Metrics
Automatic machine translation (MT) metrics are widely used to distinguish the
translation qualities of machine translation systems across relatively large
test sets (system-level evaluation). However, it is unclear if automatic
metrics are reliable at distinguishing good translations from bad translations
at the sentence level (segment-level evaluation). In this paper, we investigate
how useful MT metrics are at detecting the success of a machine translation
component when placed in a larger platform with a downstream task. We evaluate
the segment-level performance of the most widely used MT metrics (chrF, COMET,
BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state
tracking, question answering, and semantic parsing). For each task, we only
have access to a monolingual task-specific model. We calculate the correlation
between the metric's ability to predict a good/bad translation with the
success/failure on the final task for the Translate-Test setup. Our experiments
demonstrate that all metrics exhibit negligible correlation with the extrinsic
evaluation of the downstream outcomes. We also find that the scores provided by
neural metrics are not interpretable mostly because of undefined ranges. We
synthesise our analysis into recommendations for future MT metrics to produce
labels rather than scores for more informative interaction between machine
translation and multilingual language understanding.Comment: ACL 2023 Camera Read
Multi3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue
Task-oriented dialogue (ToD) systems have been widely deployed in many industries as they deliver more efficient customer support. These systems are typically constructed for a single domain or language and do not generalise well beyond this. To support work on Natural Language Understanding (NLU) in ToD across multiple languages and domains simultaneously, we constructed Multi3NLU++, a multilingual, multi-intent, multi-domain dataset. Multi3NLU++ extends the English-only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages (Spanish, Marathi, Turkish and Amharic), in two domains (banking and hotels). Because of its multi-intent property, Multi3NLU++ represents complex and natural user goals, and therefore allows us to measure the realistic performance of ToD systems in a varied set of the world's languages. We use Multi3NLU++ to benchmark state-of-the-art multilingual models for the NLU tasks of intent detection and slot labeling for ToD systems in the multilingual setting. The results demonstrate the challenging nature of the dataset, particularly in the low-resource language setting, offering ample room for future experimentation in multi-domain multilingual ToD setups
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue
Task-oriented dialogue (TOD) systems have been widely deployed in many
industries as they deliver more efficient customer support. These systems are
typically constructed for a single domain or language and do not generalise
well beyond this. To support work on Natural Language Understanding (NLU) in
TOD across multiple languages and domains simultaneously, we constructed
MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++
extends the English only NLU++ dataset to include manual translations into a
range of high, medium, and low resource languages (Spanish, Marathi, Turkish
and Amharic), in two domains (BANKING and HOTELS). Because of its multi-intent
property, MULTI3NLU++ represents complex and natural user goals, and therefore
allows us to measure the realistic performance of TOD systems in a varied set
of the world's languages. We use MULTI3NLU++ to benchmark state-of-the-art
multilingual models for the NLU tasks of intent detection and slot labelling
for TOD systems in the multilingual setting. The results demonstrate the
challenging nature of the dataset, particularly in the low-resource language
setting, offering ample room for future experimentation in multi-domain
multilingual TOD setups.Comment: ACL 2023 (Findings) Camera Read